## CSC 2224: Parallel Computer Architecture and Programming Main Memory. DRAM.

Prof. Gennady Pekhimenko University of Toronto Fall 2022

The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU

## Outline

#### 1. What is DRAM?

#### 2. DRAM Internal Organization

- DRAM Cell
- DRAM Array
- DRAM Bank
- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013 Adaptive-Latency DRAM, HPCA 2015)
  - Parallelism (Subarray-level Parallelism, ISCA 2012)

#### **DRAM Bank**



## How to build a DRAM bank from a DRAM array?

#### **DRAM Bank: Single DRAM Array?**



#### **DRAM Bank: Collection of Arrays**



#### **DRAM Operation: Summary**



#### **DRAM Chip Hierarchy**



**Collection of Subarrays** 

## Outline

1. What is DRAM?

2. DRAM Internal Organization

#### 3. Problems and Solutions

- Latency (Tiered-Latency DRAM, HPCA 2013;
   Adaptive-Latency DRAM, HPCA 2015)
- Parallelism (Subarray-level Parallelism, ISCA 2012)

#### **Factors That Affect Performance**

- 1. Latency
  - How fast can DRAM serve a request?

- 2. Parallelism
  - How many requests can DRAM serve in parallel?

### **DRAM Chip Hierarchy**



**Collection of Subarrays** 

## Outline

- 1. What is DRAM?
- 2. DRAM Internal Organization
- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013;
     Adaptive-Latency DRAM, HPCA 2015)
  - Parallelism (Subarray-level Parallelism, ISCA 2012)

#### Subarray Size: Rows/Subarray



#### Subarray Size vs. Access Latency



#### Smaller subarrays => lower access latency

#### Subarray Size vs. Chip Area

Large Subarray



**Smaller Subarrays** 



#### Smaller subarrays => larger chip area

#### **Chip Area vs. Access Latency**



#### **Chip Area vs. Access Latency**



How to enable low latency without high area overhead?





**Small Subarray** 

#### **Tiered-Latency DRAM**

Far Segment

Near Segment



- Higher access latency
- Higher energy/access

+ Lower access latency+ Lower energy/access

Map frequently accessed data to near segment

#### **Results Summary**



Tiered-Latency DRAM



#### Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture

Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

Published in the proceedings of 19<sup>th</sup> IEEE International Symposium on

**High Performance Computer Architecture 2013** 

#### **DRAM Stores Data as Charge**

## Three steps of charge movement

- 1. Sensing
- 2. Restore
- 3. Precharge



#### **DRAM Charge over Time**



Why does DRAM need the extra timing margin?

#### **Two Reasons for Timing Margin**

#### 1. Process Variation

- DRAM cells are not equal
- Leads to extra timing margin for cells that can store large amount of charge

#### 2. Temperature Dependence

#### **DRAM Cells are Not Equal**





Large variation in cellificent size Same charge Large variation in charge Large variation in access latency

#### **Two Reasons for Timing Margin**

- 1. Process Variation
  - DRAM cells are not equal
  - Leads to extra timing margin for cells that can store large amount of charge

#### 2. Temperature Dependence

- DRAM leaks more charge at higher temperature
- Leads to extra timing margin when operating at low temperature

#### **Charge Leakage Temperature**



Cells store small charge at high temperature and large charge at low temperature [] Large variation in access latency

#### **DRAM Timing Parameters**

- DRAM timing parameters are dictated by the worst case
  - The smallest cell with the smallest charge in all DRAM products
  - Operating at the highest temperature
- Large timing margin for the common case Can lower latency for the common case

#### DRAM Testing Infrastructure















#### **Obs 1. Faster Sensing**

#### Typical DIMM at Low Temperature



115 DIMM characterization

More charge

Strong charge flow

17% ↓

**Timing** (tRCD)

Faster sensing

**No Errors** 

Typical DIMM at Low Temperature *More charge Faster sensing* 

## **Obs 2. Reducing Restore Time**

Typical DIMM at Low Temperature



Larger cell & Less leakage [] Extra charge

No need to fully restore charge

115 DIMM characterization

> **Read** (tRAS) **37%**↓

Write (tWR) 54%↓ No Errors

Typical DIMM at lower temperature
<u>More charge</u> Restore time reduction

### **Obs 3. Reducing Precharge Time**

#### Typical DIMM at Low Temperature





Sense amplifier

Precharge ? – Setting bitline to half-full charge



Typical DIMM at Lower Temperature More charge Precharge time reduction

#### **Adaptive-Latency DRAM**

- Key idea
  - Optimize DRAM timing parameters online
- Two components

reliable DRAM timing parameters different

- System monitors DRAM temperature a uses appropriate

- System monitors DRAM temperature & uses appropriate DRAM timing parameters

#### **Real System Evaluation** Average Performance Improvement 25% Improvement Core **Wourle** 20% 15% 10.4% 10% 5%-0% libq mcf milc copy gups gems soplex lbm s.cluster ntensive -35-workload non-inten

<sup>w</sup> AL-DRAM provides high performance <sup>w</sup> improvement, greater for multi-core workloads

#### **Summary: AL-DRAM**

- Observation
  - DRAM timing parameters are dictated by the worst-case cell (smallest cell at highest temperature)
- Our Approach: Adaptive-Latency DRAM (AL-DRAM)
  - Optimizes DRAM timing parameters for the common case (typical DIMM operating at low temperatures)
- Analysis: Characterization of 115 DIMMs
  - Great potential to *lower DRAM timing parameters* (17 54%) without any errors
- Real System Performance Evaluation
  - Significant *performance improvement* (14% for memoryintensive workloads) without errors (33 days)

# Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case

Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu Published in the proceedings of 21<sup>st</sup> International Symposium on High Performance Computer Architecture 2015

# Outline

- 1. What is DRAM?
- 2. DRAM Internal Organization
- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013;

Adaptive-Latency DRAM, HPCA 2015)

- Parallelism (Subarray-level Parallelism, ISCA 2012)



## **Increasing Number of Banks?**



Adding more banks  $\rightarrow$  Replication of shared structures Replication  $\rightarrow$  Cost

How to improve available parallelism within DRAM?

# **Our Observation**

Local to a subarray



Time

## **Subarray-Level Parallelism**



## **Subarray-Level Parallelism: Benefits**



**Subarray-Level Parallelism** 

# **Results Summary**

Commodity DRAM

#### Subarray-Level Parallelism



## A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM

#### Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

Published in the proceedings of 39<sup>th</sup>

International Symposium on Computer Architecture 2012

# CSC 2224: Parallel Computer Architecture and Programming Main Memory Fundamentals

### Prof. Gennady Pekhimenko University of Toronto Fall 2022

The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU

## **Review #5**

#### <u>Flipping Bits in Memory Without Accessin</u> <u>g Them</u>

#### Yoongu Kim et al., ISCA 2014

# **Review: Memory Latency Lags Behind** 128x DRAM Improvement (log) 100 10 1.3x

#### Memory latency remains almost constant

# We Need A Paradigm Shift To ...

• Enable computation with minimal data movement

• Compute where it makes sense (where data resides)

• Make computing architectures more data-centric



## Why In-Memory Computation Today?



- Pull from Systems and Applications
  - Data access is a major system and application bottleneck
  - Systems are energy limited
  - Data movement much more energy-hungry than computation

## **Two Approaches to In-Memory Processing**

- 1. Minimally change DRAM to enable simple yet powerful computation primitives
  - <u>RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data</u> (Seshadri et al., MICRO 2013)
  - Fast Bulk Bitwise AND and OR in DRAM (Seshadri et al., IEEE CAL 2015)
  - <u>Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit S</u> <u>trided Accesses</u> (Seshadri et al., MICRO 2015)
- 2. Exploit the control logic in 3D-stacked memory to enable more comprehensive computation near memory
  - <u>PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture</u> (Ahn et al., ISCA 2015)
  - <u>A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing</u> (Ahn et al., ISCA 2015)
  - <u>Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation</u> (Hsieh et al., ICCD 2016)

## **Approach 1: Minimally Changing DRAM**

- DRAM has great capability to perform bulk data movement and computation internally with small changes
  - Can exploit internal bandwidth to move data
  - Can exploit analog computation capability

- Examples: RowClone, In-DRAM AND/OR, Gather/Scatter DRAM
  - <u>RowClone: Fast and Efficient In-DRAM Copy and Initializa</u> tion of Bulk Data (Seshadri et al., MICRO 2013)
  - Fast Bulk Bitwise AND and OR in DRAM (Seshadri et al., IEEE CAL 2015)
  - <u>Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses</u> (Seshadri et al., MICRO 2015)

#### **Starting Simple: Data Copy and Initialization**



# **Bulk Data Copy and Initialization**

The Impact of Architectural Trends on Operating System Performance Mendel Rosenblum, Edouard Bugnion, Stephen Alan Herrod, Emmett Witchel, and Anoop Gupta Hardware Support for Bulk Data Movement in Server Platforms Li Zhao<sup>†</sup>, Ravi Iyer<sup>‡</sup> Srihari Makineni<sup>‡</sup>, Laxmi Bhuyan<sup>†</sup> and Don Newell<sup>‡</sup> <sup>†</sup>Department of Computer Science and Engineering, University of California, Riverside, CA 92521 Email: {zhao, bhuyan}@cs.ucr.edu <sup>‡</sup>Communications Technology Lab. Intel C Architecture Support for Improving Bulk Memory Copying and Initialization Performance Li Zhao, Ravishankar Iyer Xiaowei Jiang, Yan Solihin Intel Labs Dept. of Electrical and Computer Engineering Intel Corporation North Carolina State University Hillsboro, USA

Raleigh, USA

# **Bulk Data Copy and Initialization**

*memmove* & *memcpy:* 5% cycles in Google's datacenter | ISCA'15]





Many more

#### VM Cloning Page Migration Deduplication

## Today's Systems: Bulk Data Copy



## **Future Systems: In-Memory Copy**



#### **RowClone: In-DRAM Row Copy**



Data Bus





#### Row Buffer

 Activate src row (copy data from src to row buffer)
 Activate dst row (disconnect src from row buffer, connect dst – copy data from row buffer to dst)

# **RowClone: Inter-Bank**



#### **Generalized RowClone**

#### 0.01% area cost



## **RowClone: Fast Row Initialization**



Fix a row at Zero (0.5% loss in capacity)

## **RowClone: Bulk Initialization**

- Initialization with arbitrary data
  - Initialize one row
  - Copy the data to other rows
- Zero initialization (most common)
  - Reserve a row in each subarray (always zero)
  - Copy data from reserved row (FPM mode)
  - 6.0X lower latency, 41.5X lower DRAM energy
  - 0.2% loss in capacity

#### **RowClone: Latency & Energy Benefits**



## **Copy and Initialization in Workloads**



## **RowClone: Application Performance**



# **End-to-End System Design**

# Application **Operating System** ISA Microarchitecture DRAM (RowClone)

How to communicate occurrences of bulk copy/initialization across layers?

How to ensure cache coherence?

How to maximize latency and energy savings?

How to handle data reuse?

# Ambit

In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology

#### Vivek Seshadri

Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, Todd C. Mowry SAFARI Carnegie Mellon

# **Executive Summary**

- Problem: Bulk bitwise operations
  - present in many applications, e.g., databases, search filters
  - existing systems are memory bandwidth limited
- Our Proposal: Ambit
  - perform bulk bitwise operations completely inside DRAM
  - bulk bitwise AND/OR: simultaneous activation of three rows
  - bulk bitwise NOT: inverters already in sense amplifiers
  - less than 1% area overhead over existing DRAM chips
- Results compared to state-of-the-art baseline
  - average across seven bulk bitwise operations
    - 32X performance improvement, 35X energy reduction
  - 3X-7X performance for real-world data-intensive applications



[1] Li and Patel, BitWeaving, SIGMOD 2013[2] Goodwin+, BitFunnel, SIGIR 2017

## Today, DRAM is just a storage device!



Throughput of bulk bitwise operations limited by available memory bandwidth

72





## **DRAM Cell Operation**





#### **Triple-Row Activation: Majority Function**



77

#### **Bitwise AND/OR Using Triple-Row Activation**



#### **Bitwise AND/OR Using Triple-Row Activation**



## Bulk Bitwise AND/OR in DRAM tically reserve three designated rows t1, t2, and t3

#### It = row A AND/OR row B

3.

4.

5.

- L. Copy data af a owf Artowow t1 or w t1
- 2. CopydataafaowBrtowBt2o row t2

#### **MICRO 2013**

#### RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Vivek Seshadri Yoongu Kim Chris Fallin\* **Donghyuk Lee** vseshadr@cs.cmu.edu yoongukim@cmu.edu cfallin@c1f.net donghyuk1@cmu.edu Rachata Ausavarungnirun Gennady Pekhimenko Yixin Luo rachata@cmu.edu gpekhime@cs.cmu.edu yixinluo@andrew.cmu.edu Phillip B. Gibbons<sup>†</sup> Michael A. Kozuch<sup>†</sup> Todd C. Mowry Onur Mutlu onur@cmu.edu phillip.b.gibbons@intel.com michael.a.kozuch@intel.com tcm@cs.cmu.edu Carnegie Mellon University †Intel Pittsburgh

## **Bulk Bitwise AND/OR in DRAM**

tically reserve three designated rows t1, t2, and t3

#### It = row A AND/OR row B

- 1. Copyodalbaeofatooxir Avtotrow ti
- 2. Copyorabacotatoo for Byteotorow t2
- 3. Initializev Clarta de fa rów vt B to 0011
- 4. Activate rows/t1/t2/t3 simultaneously
- 5. Copyodabaeofatooxivt1/t12t/1/33to Result row

# Use RowClone to perform copy and initialization operations completely in DRAM!







#### Ambit vs. DDR3: Performance and Energy

Performance ImprovementEnergy Reduction



### Integrating Ambit with the System

#### 1. PCIe device

- Similar to other accelerators (e.g., GPU)

#### 2. System memory bus

Ambit uses the same DRAM command/address interface

#### Pros and cons discussed in paper (Section 5.4)

## **Real-world Applications**

- Methodology (Gem5 simulator)
  - Processor: x86, 4 GHz, out-of-order, 64-entry instruction queue
  - L1 cache: 32 KB D-cache and 32 KB I-cache, LRU policy
  - L2 cache: 2 MB, LRU policy
  - Memory controller: FR-FCFS, 8 KB row size
  - Main memory: DDR4-2400, 1 channel, 1 rank, 8 bank

#### Workloads

- Database bitmap indices
- BitWeaving column scans using bulk bitwise operations
- Set operations comparing bitvectors with red-black trees

## **Bitmap Indices: Performance**



onsistent reduction in execution time. 6X on average

## Speedup offered by Ambit for BitWeaving select count(\*) where c1 < field < c2</pre>

#### Number of rows in the database table



Speedup offered by Ambit

89

## **Review #5**

#### <u>Flipping Bits in Memory Without Accessin</u> <u>g Them</u>

#### Yoongu Kim et al., ISCA 2014

## CSC 2224: Parallel Computer Architecture and Programming Advanced Memory

Prof. Gennady Pekhimenko University of Toronto Fall 2022

The content of this lecture is adapted from the slides of Vivek Seshadri, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU